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I3!lo?fi^f PPa ^f US and co " 1 P uter Program products for information retrieval and document 
classification utilizing a multidimensional subspace 



(57) Methods, apparatus and computer program 
products are provided for retrieving information from a 
text data collection and for classifying a document into 
none, one or more of a plurality of predefined classes. 
In each aspect, a representation of at least a portion of 
the original matrix is projected into a lower dimensional 
subspace and those portions of the subspace represen- 
tation that relate to the term(s) of the query are weighted 



following the projection into the lower dimensional sub- 
space. In order to retrieve the documents that are most 
relevant with respect to a query, the documents are then 
scored with documents having better scores being of 
generally greater relevance. Alternatively, in order to 
classify a document, the relationship of the document to 
the classes of documents is scored with the document 
then being classified in those classes, if any, that have 
the best scores. 
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Description 

CROSS-REFERENCE TO RELATED APPLICATIONS 

[0001] The present application is a continuation-in-part of U.S. Patent Application No. 09/328,888 entitled METHOD 
AND SYSTEM FOR TEXT MINING USING MULTIDIMENSIONAL SUBSPACES filed June 9, 1999 by D. Dean B.llhe- 
imer et al. (hereinafter "the "888 application"). The contents of the "888 application are hereby incorporated by reference 
in its entirety. 

FIELD OF THE INVENTION 

[00021 The present invention relates generally to text mining and, more particularly, to retrieving information and 
classifying documents in an efficient and effective manner by utilizing multidimensional subspaces to represent se- 
mantic relationships that exist in a set of documents. 

BACKGROUND OF THE INVENTION 

[00031 Text mining is an extension of the general notion of data mining in the area of free or semi-structured text 
Data mining broadly seeks to expose patterns and trends in data, and most data mining techniques are soph.sficated 
methods for analyzing relationships among highly formatted data, i.e., numerical data or data w.th a relatively small 
fixed number of possible values. However, much of the knowledge associated with an enterprise consists of textually- 
expressed information, including free text fields in databases, reports and other documents generated m the company, 
memos e-mail, Web sites, and external news articles used by managers, market analysts, and researchers. This data 
is inaccessible to traditional data mining techniques, because these techniques cannot handle the unstructured or 
semistructured nature of free text. Similarly, the analysis task is beyond the capabilities of traditional document rnan- 
agement systems and databases. Text mining is therefore a developing field devoted to helping knowledge workers 
find relationships between individual unstructured or semi-structured text documents and semant.c patterns across 

large collections of such documents. «. m „-j 

[0004] Research in text mining has its roots in information retrieval. Initial information retneval work began around 
1960 when researchers started to systematically explore methods to match users' queries to documents in a database. 
However recent advances in computer storage capacity and processing power coupled with massive increases in the 
amount of text available on-line have resulted in a new emphasis on applying techniques learned from information 
retrieval to a wider range of text mining problems. Concurrently, text mining has grown from its ong.ns in simple infor- 
mation retrieval systems to encompass additional operations including: information visualization; document classifica- 
tion and clustering; routing and filtering; document summarization; and document cross-referencing. All of the text 
mining operations listed above share the common need to automatically assess and characterize the similarity between 
two or more pieces of text. This need is most obvious in information retneval. 

[0005] All information retrieval methods depend upon the twin concepts of document and term. A document refers 
to any body of free or semi-structured text that a user is interested in getting information about in his or her text mining 
application. This text can be the entire content of a physical or electronic document, an abstract, a paragraph, or even 
a title "Document" also encompasses text generated from images and graphics or text recovered from audio and v.deo 
objects Ideally, a document describes a coherent topic. All documents are represented as collections of terms, and 
individual terms can appear in multiple documents. Typically, a term is a single word that is used in the text However 
a term can also refer to several words that are commonly used together, for example, "land.ng gear. In addition the 
terms that represent a piece of text may not appear explicitly in the text; a document's terms may be obtained by 
applying acronym and abbreviation expansion, word stemming, spelling normalization, thesaurus-based subst.tut.ons. 
or many other techniques. Obtaining the best set of terms for a given document is dependent upon the document or 
the collection to which the document belongs and the particular goal of the text mining activity. Once a suitable set of 
documents and terms has been defined for a text collection, various information retrieval techniques can be applied 
to the collection. These techniques can be grouped into four broad categories: keyword search methods, natural lan- 
guage understanding methods, probabilistic methods, and vector space methods. Each category as well as its relative 
advantages and disadvantages is discussed in the '888 application and reference is made to the 888 application for 

further information. . ... Hi 

[00061 With respect to traditional vector space methods, individual documents are treated as vectors in a h.gh-d.- 
mensional vectorspace in which each dimension corresponds to some feature of a document. A collection of documents 
can therefore be represented by a two-dimensional matrix of features and documents. In the typical case, the 
features correspond to document terms, and the value of each term is the frequency of that term in the specified 
document. For example, if term t, occurs four times in document \. then D, f v is set to 4. Similarly, if term y does not 
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occur in d,, then D, 2|; is set to 0. More complex types of vector space methods, such as latent semantic indexing 
(LSI) involve ways of transforming D, e.g. singular value decomposition (SVD) or semi-discrete decomposition (SDD) 
which typically attempt to provide a more sophisticated set of features and a better measure of the importance of each 
feature in a document. 

5 [0007] By representing documents as vectors in a feature space, similarity between documents can be evaluated 
by computing the distance between the vectors representing the documents. A cosine measure is commonly used for 
this purpose, but other distance measures can be used. To use the vector space method for information retrieval a 
users query is treated as a pseudo-document and is represented as a vector in the same space as the document 

w th^ ? S ^"u 6 betWee " the qU6ry V6Ct0r 3nd e3Ch ° f the do ™ment vectors is computed, and the documents 
'0 that are closest to the query are retrieved. 

[0008] The advantages of the vector space method are that it provides a simple and uniform representation of doc- 
uments and queries, can accommodate many variations appropriate to different document collections, and has been 
shown to perform relatively well in information retrieval applications. In addition, representing documents as vectors 
could be useful for all other text mining operations. However, the performance of the basic vector space method is 
severely limited by the size of D. In actual document collections, both the number of documents and the number of 
terms are typically quite large, resulting in a large D, and making the necessary distance calculations prohibitively slow 
It is possib e to allev.ate this problem by preselecting a subset of all possible terms to use in the matrix, but this can 
degrade information retrieval performance and limit text mining capability. Finally, while the traditional vector space 
method provides a way of assessing the similarities between pieces of text, it alone does not provide a good way to 
o visualize these relationships or summarize documents. 

[0009] As described by the '888 application, an improved vector space method has been developed that allows the 
user to efficiently perform a variety of text mining operations including information retrieval, term and document visu- 
alization term and document clustering, term and document classification, summarization of individual documents in 
groups of documents, and document cross-referencing. In this technique, the document collection is represented using 
a subspace transformation based on the distribution of the occurrence of terms in the documents of the document 
collection. In particular, a term-by-document frequency matrix D is initially constructed that catalogs the frequencies 
of the var.ous terms for each of the documents. The term-by-document matrix can then be preprocessed to define a 
working matrix A. by normalizing the columns of the term-by-document matrix D to have a unit sum. stabilizing the 
variance of the term frequencies via a nonlinear function and then centering the term frequencies with respect to the 
mean vector of the columns. This preprocessing is denoted as A = f(D)- ce r in which c is the mean of the columns of 
m and e is a d-vector whose components are all 1. so that the average of the columns of A is now 0. Each ii<" entry 
in A » therefore a score indicating the relative occurrence of the i» term in thef document~Traditionally, f is defined 
as a two-sided weighting function, i.e., 

f(D) = (W,D) W „ 

wherein ^ and ^ are two diagonal scaling matrices for weighing terms and documents, respectively, as known to 
tnose skilled in the art. 

[0010] To capture some of the semantics latent in the documents, i.e.. to- capture similarity of content despite varia- 
tions in word usage such as the use of synonyms, the working matrix A is orthogonally decomposed to obtain a rank- 
le matnx A* that approximates A. In this regard, the orthogonal decomposition of the working matrix A can be performed 
with a number of decompositional techniques, such as a two-sided orthogonal decomposition " 
[0011] By way of example, one typical two-sided orthogonal decomposition is a truncated URV (TURV) decomoosi- 

V n wi* r ,V6n di T en f i0n *\ the TURV com P"t« ^ses of subspaces with high information content (matrices uTand 
Yjj with orthonormal columns) satisfying the equation: 



so 



wherein R* is a triangular matrix of order k. Then an approximate term-document matrix A. is defined 



as: 



55 A„ =^^vq,^ 



[0012] For the approximation A*, as well as for A. each row corresponds to a term and each column corresponds to 
a document. The if entry of A* therefore provides a relative occurrence of the f term in the f document, but this 



3 



EP 1 199 647 A2 



relative occurrence has now been filtered by the approximation which captures semantics latent in the documents. 
More specifically, the factor captures variations in vocabulary, while the factor^ brings out latent structure m the 

r0013] ^oTowTng^he orthogonal decomposition designed to capture some of the semantics latent in the documents, 
the matrix A k can be searched to identify the documents that are most relevant to a particular query. In traditional vector 
space as well as latent semantic indexing approaches, the query is treated as a pseudo-document and may be rep- 
resented as a vector q of length t. Each component of the query vector a records the occurrence of the corresponding 
term in the query. While the query can be much like another document and have numerous terms, the query oftentimes 
contains just a few term, called keywords. Regardless of its size, the query is then compared to the term-document 
matrix A* in order to identify occurrences of the terms included within the query following the capture of some of the 
semantics latent in the document. 

[00141 In this comparison process, each of the d documents (each column of ) is compared to the given query, or 
rather its projection into . and a score is assigned based on this comparison. According to one conventional technique, 
a score vector is calculated as follows: 

wherein * is a measurement function applied to V & and each column of 4. and wherein _ is the projection matnx 
for the k-dimensional subspace* and is defined as^V«. Traditionally. \ could be the inner product, the cosine, 
or the Euclidean distance of the vectors. The documents having the best scores can then be returned as the documents 
most relevant to the particular query. It can be shown that for the inner product and Euclidean distance, two traditional 
choices fort, the projection «* will not alter the sorting result. For example, since**** and & the score resulting 
from inner product is not changed If* is removed from the determination of the score vector. Therefore, it is more 
common to define the score-vector as: 



[001 51 The components of the score vector determine the relative performance of the documents against the query. 
Selecting which documents to return to a user can be accomplished in a variety of methods, typically by returning he 
best scoring documents. The best scoring documents could be identified, for example, by applying a threshold to he 
individual scores, by taking a fixed number in ranked order, or by statistical or clustering techniques applied to the 
vectors of the scores. 

[00161 Treating each query as a pseudo-document is certainly a viable technique and provides valuable information 
in many instances, particularly in instances in which the query is an actual document and the user wishes to identify 
otherdocuments like it. By treating each query as a pseudo-document, however, the above-described scoring technique 
may suffer from several difficulties in certain circumstances. In this regard, a query vector having just a few terms 
contains only a few non-zero components. As such, the measurement function,,, , may be corrupted by entries in the 
term-document matrix ^ that are not of interest or are irrelevant with respect to the query, i.e., entries m the rows 
of A* that correspond to terms of the query that have a zero component. In this regard, terms of a query that have a 
zero component should be treated as being irrelevant for purposes of the comparison, that is. documents having the 
terms of the query that have a non-zero component should receive a relatively good score regardless of whether or 
not the documents include the terms that have zero components in the query. However, by treating quenes as pseudo- 
documents, the absence of certain terms is interpreted to mean, not that it is irrelevant as to whether the terms are 
present or not, but that the terms should occur at a below average frequency since both the original set of documents 
and the query have been centered with respect to the mean vector of the respective columns, thereby transforming 
entries that were originally zero to some other fractional value. 

[00171 Moreover, the scores that are determined as described above may also be misleading if a document makes 
disproportionate use of the various terms that comprise a query. A typical query contains few terms and each typically 
occurs only once and when this is treated as a pseudo-document the documents containing these terms ,n roughly 
equal proportions will be more likely to be returned than documents that contain all of the terms, possibly in substantial 
numbers but in unequal proportion. Finally, documents that include one or more high frequency terms may receive a 
misleadingly good score even though those documents include very few, if any, of the other terms of the query; which 
are of equal importance in determining the C3 relevance of the documents than the high frequency terms** • It 
would therefore be desirable to weight the various terms included within the search query. As such, the preprocessing 
function f typically includes a term weighting factor* to reduce the impact of high-frequency terms and the dispropor- 
tionate use of the terms. This type of term weighting is a type of global weighting since it is calculated based on the 
entire document set. Since traditional term weighting is calculated based on the entire document set. the addition of 
new documents or the removal of old documents from the document collection requires the term weighting factor to 
again be determined for all of the documents, including those that remain from the prior collection. As will be apparent, 
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this ^computation of the term weighting factor can be relatively time consuming and processing intensive in situation, 

^:7cV:z^t:i docu : en,s h Add w ona,,y ' by 9iobai,y appiying a ^ ***** ^tzz;:t po . 

tance of certa n terms in a document is changed such that the resulting subspace representation K will not be su.Jhil 

SLT^L. T"" 1 , 6 !.''? 8 "?'"* 1° S ' a "* 3 C °" 8C "° n °' «°>»"™* >° 'eWeve information on classify no„ 

w each term without requ.nng extensive recomputation of the weighting factors as the document collection is updated. 
SUMMARY OF THE INVENTION 

W of the query are weighted following the projection into the lower dimensional subspace Thus a ptrX 2 docu 
Z lort" t SC ° red Z 3 d ° CUment ° an bS ClaSSifi6d in a re,iab,e fashi °" sincJ high-frequency t^ and the 

- to the que° a- not SrXed J* r^ 8 ^ rSSUltS ' nd ^ te ™ ™ LkZXZ «p2 

Lrrmllw fol considered In addition, updating of the text data collection is simplified since the weights are 

• b^^—.-^ 

thi « rTr* repreSenting the of occurrence of a term in a respective d^lm A^coE to 

30 tern T^d^iIZI oTi * , P J do ' ument mdex,n 9- A query is received that typically identifies at least one 

<o e ther ZZT* 9 dep f nd,ng Upon lts treatment. As such, a determination is initially made to treat the queT"s 

que" f r h e query ; r b :t°.w ifjrr, depef . h din9 at ,east partia,,y upon the numbe " ° f ter ™ 

,? th« n .T™1! L ? b e heated as a set of terms, the query is processed and scored as described above Alternatively 
matrix anT, V f 33 ' P seud °- document . ■ -presentation of at least a portion of the SSbSSSS 

TcoHecln of te t,T A presentation of the document to be classified is received that consSts of 
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of the subspace representation A* Depending upon the scores of the relationship of the document to each predefined 
class, the document may be classified into none, one or more of the plurality of the predefined classes 
[0022] According to either aspect of the present invention, the weighting of at least those portions of the subspace 
representation A, relating to at least one term can be performed in a variety of fashions. In this regard, the subspace 
representation AI includes a plurality of rows corresponding to respective terms. In one embodiment each term is 
weighted by determining an inverse infinity norm of the term, i.e., the inverse of the max.mum of the absolute values 
of the entries in the row of the subspace representation A* corresponding to the term. In another embodiment each 
term is weighted by determining an inverse one-norm of the term, i.e.,the inverse of the sum of the absolute values of 
the entries of the row of the subspace representation A* corresponding to the term. In yet another embodiment, each 
term is weighted by determining an inverse 2-norm of the term. i.e.. the inverse of the square root of the sum of the 
squares of the entries in the row of the subspace representation A* corresponding to the term. 
[0023] Accordingly, the methods, apparatus and computer progTam products of the present invention provide im- 
proved techniques for retrieving information from a text data collection and for classifying a document into none, one 
or more of a plurality of predefined classes. By weighting the term(s) of the query when treated as a set of terms or 
the term(s) of the document to be classified following the projection into the lower dimensional subspace. a plurality 
of documents can be scored or a new document can be classified in a reliable fashion since high-frequency terms and 
the disproportionate occurrence of terms in documents will not unnecessarily skew the results and since terms that 
are irrelevant with respect to the query are not considered. In addition, updating of the text data collection is simplified 
since the weights are determined following the projection of the original matrix into the lower dimensional subspace 
thereby avoiding the difficulty of having to recompute each row-scaling factor in every instance in which a new document 
is added or an existing document is removed from the text data collection. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0024] Having thus described the invention in general terms, reference will now be made to the accompanying draw- 
ings wherein: 

Figure 1 is a flow diagram illustrating the overall logic of a text mining program formed in accordance with the 
present invention; 

Figure 2 is a flow diagram illustrating logic for generating a term list; 

Figure 3A is a flow diagram illustrating logic for performing indexing that provides a representation of the documents 
for text mining operations; 

Figure 3B is a flow diagram illustrating logic for performing classifier training; 
Figure 4 is a flow diagram illustrating logic for performing update indexing; 

Figure 5 is a flow diagram for determining a new subspace representation by updating an existing subspace with 
new documents and terms; 

Figure 6 is a flow diagram illustrating the logic of performing information retrieval operations; 
Figure 7 is a flow diagram illustrating the overall logic associated with a document classification operation in ac- 
cordance with the present invention; 

Figure 8 is a more specific flow diagram illustrating the logic of performing a document classification operation, 
Figures 9A and 9B graphically illustrate the entries in a subspace representation Aj of an exemplary collection of 
documents for the terms engines and idle, respectively. 

Figures 1 0A and 1 0B are graphical illustrations of the sorting of the documents utilizing unweighted and weighted 

techniques, respectively. . 

Figure 11 is a block diagram of a general purpose computer system suitable for implementing the present invention. 
Figure 12-23 are a screen representation of a preferred embodiment according to figure 11. 

DETAILED DESCRIPTION OF THE INVENTION 

[0025] The present invention now will be described more fully hereinafter with reference to the accompanying draw- 
ings in which preferred embodiments of the invention are shown. This invention may, however, be embodied in many 
different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments 
are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to 
those skilled in the art. Like numbers refer to like elements throughout. 

[0026] The methods, apparatus and computer program products of the present invention perform text mining oper- 
ations and, more particularly, information retrieval and document classification operations. In performing these oper- 
ations the methods, apparatus and computer program products of the present invention utilize a mult.dimens.onal 
subspace to represent semantic relationships that exist in a set of documents in order to obtain more meaningful results. 
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According y. the methods, apparatus and computer program products of the present invention are capable of processina 
HSS T 3 reasonab| y fast Processing time without requiring prior knowledge of the data process,n 9 
6ooulJ^ZV' Q T ^ gramS illustratin 9 the '°9 ic ° f Performing information retrieval operations on a text 
5 1 collecon according to one aspect of the present invention. As described hereinafter, the logic of performing 

" t^ ^tS^r^Tr^ reSP6CtS t0 '° giC ° f Perf ° rming informati0n re,rieval a " d is *»*^ 
described i \cll „i »h m f h0d ; apparatus and com P^er program product of the present invention will be initially 
tfrZll ' COniUnctl °" ™ th '"formauon retneval operations. As explained hereinafter, the logic associated with both 
,nfo mat on retneval and document classification treat queries as either a pseudo-documenfas described above n 
conjunct™ with conventual techniques or in a unique manner as a set of terms or keywords. By permitting queries 
tn o^TlT ; pSeud °- documents ° r sets <* bywords, the information retrieval document eS 
mo7»t c T 0rd ;r 9 3 mannSr th3t Wi " be m ° St efficient and effective for the Particular query, 
ooerf Ln« tT 3 W 'I**™ illustratin 9 the overa " *>9* <* the present invention relating to information retrieval 
coltec ion on » 9 1 C m ° VeS , r ° m ■ Start block t0 decision b '°<* 100 where a test is made to determine if the documen 

« collection Gene 3^ T r° t ' ?" '°f m ° V6S * b '° Ck 1 ° 4 Where 3 term liSt is generated from *• documen 
Ser Next t fl ! 1 > h ' ■* '"'if d ° CUment CO " eCtion is M,UStrated in detail Fi 9 ure 2 ' an ° * Ascribed 
2™« h« ^ ' X ' n9 13 performed ' as iUwtrated in detail in Figure 3A and described later In basic 

ni£ indexing ^7 7°'^ creation of »• -^space representation A, from the document collection After 
es 11 mad* to h f erf0rmed ' ° r ,f ' he document collect ion is not new. the logic moves to decision block 108 where a 

20 no, exiMh > Lt ™ rm ' ne r the pro 9 ran ? , should " «>. the logic of the present invention ends. If the program should 
not exit, the log.c moves from dec.s.on block 108 to decision block 110 where a test is made to determine if documents 
have been added to the collection. If so. the logic moves to decision block 112 where a test is made to de ermine^ re 

merover time e th PSrf0rm6d - ^T^ 9 ^ SUbSPaC6 t0 appr °* imate the effects of the n^ docu- 

fZed mI Tk appr ° x,r " atlon of u P^te indexing will gradually lose accuracy, and re-indexing should be per- 
25 of wh^ ?o re ' e r Stab " Sh tha latent semantlc structure of the modified document collection. Preferably, the determination 

hlr«o esL'ateZ oro " ^ * 3 UMK Prefer3b ' y the USer has been provided witb data that allows him or 

a cVnven ST T'" 9 erronn approxima «°n. ™. user can then perform re-indexing to renew the subspace at 

dedsTon bbck Tl2 Till ZTT " */ determi " ed th3t re -'' ndexing Sh ° U,d be performed ' th * ^ — ^ 
fnd™ no lo 1 u 6 Wh6re indeXin9 ' S P erformed a * described later with reference to Figure 3A The re- 

» 5532^1" H ame 38 the j nitia ' indeXin9 '° giC - ' f r6 - indexing Sh0u,d not be perf °rmed, t^ logic moves to block 
describe I £ Afl" X, " rf 9 ' S Perf ° rme , d - ThS '° giC ° f Perf0rming Update indexi "9 is illustra ted in defai. in Figure 4 and 
determ^ f P^o™ng re-,ndexing 106 or update indexing 114, the logic moves to decision block 108 to 
determine if the program should exit. 

the° loL movifto o" b ' OCk 1?° 1! ? dSte : mined that there were not any documents added to the document collection. 
* Le "pSrS^STi^l Where3 1 ! eSt " m3de t0 dStermine if an information ^trieva, operation should 
If so the lonir'm f w I ^1^'^ b3SSd 0n 8 USer request to perform a " ^formation retrieval operation 
as deoic ted 9 in Ftaure 6 Aftt Z^™™* ° f a text fining operation, name.y. an information retrieval operation 
bloS 108 *to lt?IT„ f * perf °r ma nce of the information retrieval operation 118. the logic moves to decision 
andlh* Hi d etermme .f me program should exit. If so, the logic ends. If no, the logic moves to decision block 110 
and the log.c of blocks 108 through 118 is repeated until it is time to exit. It will be appreciated by those skilled in the 

^:£^*r orrned in Fi9ure 1 can be performed in a different x 

Ek laOw^e'^mfrJ"^ 613 ' 1 ? 6 ,09 ^ ° f genef3ting 3 tSrm ,iSt ThS IOgiC ° f Figure 2 moves f rom a start block 
or lette J m.mhl T - 26 aCCordm 9 to a tckenizing policy, (e.g.. sequences of letters, letters and numbers 
: ter^Sh t , 3 ^ Cert3,n punctUation like ^"ens or slashes, i.e., whatever is needed to capture the important 
ooTcv s oowo'rT T 3 "! " apP ' icati0n )- Next - in bl °<* 1 32, stopwords are removed according to a stopwords 
coniuncl Z 7%£ T * ^ d ° "« to the overall topic of the documents, such as 

doTot seTe to o J h P : ep ° S,tl0ns ' or terms that ar e frequency used throughout the document, and thus 
oolicv for Z h P,C3 V d ! St ' ngu,sh one docume nt from another. The optimal set of stopwords (i.e.. the stopwords 
occul . . 7, ( ° eCt '° n ' S tyPiCa " y SpedfiC 10 that d0cument col,ection - Low frequency words, i e words 
block 3 9 4 Thlt 3 ^ d ° C , Ument C ° ,,eCti0n ' 3rS rSm0Ved aCCOrding 10 a low f reUnc Y y words policy See 
taa^"^^h^^^. P0,,Cy ' S b3S6d 00 d ° CUment COl,ection ' This polic V ma V be n ot to remove low 
Zl ?, f S ' the cT V m3king th ' S 30 ° Pti0na ' Step - AS man V as nalf of the terms in a typical data collection occuT 
less than five t.mes. El.m,nat.ng these low frequency terms from* is an optional step that can greatly increase com- 

nZTZTtT 3 H 'T ° f inf0rmati ° n " SUbSP3C6 - The '° gic then moves 10 block 8 Xe "rm 
document c^liction Tn"! aCC ° rd,ng t0 3 term normalization poNcy. The term normalization policy is based on the 
document collection. This pol.cy may be not to perform any term normalization, thereby making this an optional step 

vr 0 ;ra a sfonTr y » in ? U f e th aCr ° nym eXP3nSi0n " C ° TS " iS the Same as "ccmmerciafoff- h'sheJ " ItZ 
v.at,on expansion (e.g., ref. Is the same as "reference" in some document collections), and other term normalization. 
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Other term normalization is specific to the document collection for example, m a document co.tect.on P^n.ng to 
different commercial aircraft models, it might be desirable to group mode,nu i m w berS h t0 f ^^J^. . 747 and 737. 
The term normalization can include any combination of term normalization including but not limited to those P™» u «y 
Lted So^eTthe term normalizations may be performed more than one time. The term norma, ,zat .on , po hcyd. fines 
the term normalizations and their order of performance for a given document collection. In block 142 stemming is 
performed ac^X to a stemming policy. The stemming policy is based on the document coHecti on TN. . po£ ^ay 
oe not to perform stemming, thereby making this an optional step. Stemming eliminates conjugate forms of a , word L e. 
g -es," ed," and "ing" and keeps only the root word. Care needs to be taken when performing stemming, to sample. 
t would no be desirable to change "graphics" to "graph" or Boeing" to "Boe." Finally, in block 144 the term hst ,s stored^ 
When a document collection changes, and update indexing or re-indexing is performed, the same P^-. onj."-^ 
used to generate the term list. i.e.. the same term tokening policy 130. the same stopwords policy 132 th e same low 
frequency words policy 134. the same term normalization policy 138. and the same stemming policy 142. are used to 
update the term list. The logic of Figure 2 then ends and processing is returned to Figure 1 

[0031] Figure 3A is a flow diagram illustrating the logic of performing indexing. Indexing .s Performed I on the n.t.al 
document collection, as we., as when it is determined that re-indexing should occur (see Figure 1). The ogic of F gure 
3A moves from a start block to block 1 50 where a term-by-document or term frequency matnx ,s computed. The .term- 
by-document matrix D is defined from a set of d documents that have been derived from a free or sem-struct .red text 
collection. Across this" document collection, statistics are accumulated on the frequency of occurrence of ■ eac ternv 
Each entry* is the raw frequency of the term in the given document, i.e.*. is the number of t.mes term\ occurs n 
document "typically quite sparse. For example, it is common to find term-by-document matnces with over 98% 

[0032] en After^hecomputation of the term-by-document matrix, the logic moves to block 1 52 where statistical trans- 
oSonfof matrix entries are performed according to a statistical transformation P-^^J^^ZS 
policy may be not to perform any statistical transformations, thereby making th.s an optional step. Better results may 
be achTved through statistical transformation. Exemplary transformations include: (1) adjusting a ^rm-by-docu- 
ment by the sum of the termfrequencies of the document, thus obtaining a relative frequency of occurrence. ^2) apply ng 
a transformation to the date (e.g.,taking the arcsine of the square root of the relative frequences) to stable the 
variance of the sampling frequencies, thereby making words with radically dtfferent frequencies more compa rab le ^nd 
(3) centering the data around the origin by subtracting the row average from each term-by-document. Obtaining a 
re Live frequency, and stabilizing the variance of the sampling frequencies make the term f . re ^ enc ' e ^ m0 n 7 f ^^; 
rable to each other from one document to the other, while centering the data makes the ^nterpreta th « d * a 

statistically more meaningful. Obtaining a relative frequency, and stabilizing the varia nee of he sarn pUngf equen cies 
themselves do not change the sparsity of the matrix. However, centering the data does destroy the spars.ty of* and 
is sometimes avoided for computational reasons. _ , ,„,„„„ ... 

[0033] In one advantageous embodiment, the initial term-by-document matrix^ having a plurality of columns, one 
of which represents each document, and a plurality of rows, one of which represents each term, » P re P^ essed J° 
form a working matrix*. In this embodiment, the preprocessing includes normalizing the columns of matrix 9- to have 
unTsum. stabilizing the variance of term frequencies via a non-linear function, and ther -^enng w^ect to he 
mean vector of the columns. The preprocessing can be mathematically represented by£ vS>V* " h f * 
mean vector andfc is a d-vector whose components are all 1 so that the average of the columns of » is now zero. 
As such each^ entry in* is a score indicating the relative occurrence of the \« term .n theV document. 
£.034] The weighting function V preferably includes a column-scaling factor W, for weighting the ma nx on a docu 
ment-by-document basis. However, the weighting function^ preferably does not include a row-scal.ng factor* s as 
to facilitate the updating of the working matrix as documents are added or removed from the text *"™ nt 
since the row scaling factors do not have to be determined across all of the documents. In one embodiment, for example, 
the weighting function V is defined as: 



f(D)-=sin 1 ( sqrt(DW rt )) 



[0035] Following the statistical transformation of the matrix entries, the matrix A, s projected into ' tow " d ^" 8 »" a J 

subspace. For example, the working matrix* can be projected into a k dimensional subspace. thereby defimng the 

subspace representation V While the working matrix A can be projected into the subspace accord, "? to 

techniques including a variety of orthogonal decompositions, the projection of» into the subspace 

via a Iwo-sided orthogonal matrix decomposition, such as a truncated URV (TURV) ^P 05 *^ 

the -888 application in orderto expose the latent semantic structure of the document collection. The TURV ^ecompo 

siion proves a means of projecting the data into a much lower dimensional subspace that ^ ures ^ e i e a S ! j e ;2 

patterns of relatedness among the documents. Statistically, the effect of the TURV is to combine the onginal large set 
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of vanables into a smaller set of more semantically significant features. The coordinates of the projected data in the 
reduced number of dimensions can be used to characterize the documents, and therefore repreZTZ % ec ^ 

the TURV T:ZTTT S ° f T " 3 ^ hUOdred ° r m ° re S ^ icant features - As a »«* ^ the £pS£ o 
the TURV. the resultant subspace will capture the latent semantic structure of the original matrix A removinc ! 1 

differences that accrue from the user's variability in word choice to describe the same ideas thus enhaTic ng the ab itv 
to perceive the semantic patterns in the data. Following the projection of the working matrix A 'nfo the Sensional 
subspace. the logic of Figure 3A returns processing functionality to Figure 1 . As will b descled hereinaft The e^rt 
subspace representation A, need not always be determined. Instead only those portions, i e those Zt s o the sub 
space representat.cn A, that correspond to the terms included within the query must be determined thereby conservino 
processing resources and reducing processing time lermineo, mereDy conserving 

Fiaure4 Th^ IT f™™"! 5 ™ added ,0 the document Action, update indexing is performed, as illustrated in 
Figure 4. The logic of Figure 4 moves from a start block to block 160 where a term-by-document matrix for the new 

the S TT' \ N6Xt ' I" b ' OCk 162 3 StatiStiCa ' «"n. to m»tlon of the matrix entries J£^££££Z 
W whe r ^ „!" Sf T " P0 " Cy (SSe b ' OCk 152> RgUre 3A) - Sti " refe ™9 10 Fi 9 ure <■ the logic then moves to block 
te^ns as H.u ZZESE [f h! : determined b * updatin 9 »» existing subspace w»h new documents and 
to Rgure i ^ ' ^ deSCnbed The '° giC ° f R 9 ure 4 then ends and Processing is returned 

ELI! 1 ! 'wf ° f F L 9Ure , 5 d6termines a new subs P ace representation by updating the existing subspace with new 
documents and terms by initially moving from a start block to block 170 where new documents are projected on the 
ongmal subspace and the residual is computed. Next, in block 172. the existing term subspace J^s aSmented wSh 

bv^Z re n d r\ T ^ " 0rth090nal t0 ,hS ° riginal term Subspace ' and the document'subsrice.rrexpanded 
subsntfn L ! tV m3tr,X acCOrdin 9'y- See the ' 888 a PP"cation for a more detailed description of the"erm 
subspace U, and the document subspace V*. The logic then moves to block 174 where the k most significant featureT 

^;^^^r in ' for examp,e - by rank - reveaiing t - *• - F^rsrs: 

in b^K S U r 6 n S M teS '° giC ° f information re,rieval w hich commences with the receipt of a query as shown 
n Nock 200. Typically, he query includes at least one term, although in some instances the query may be devoid of 
terms ^ since Jor example, the query may have been composed of one or more words that do not se^e as terms As 
depicted in block 202. an initial decision is made as to whetherto treat the query as a pseudo-document or aTa set 
of terms and then to process the query differently depending upon its treatment Thus the method apparatus and 

IZTZZT™ 7 * ° f thiS 3SPeCt ° f PreS6nt inVentl '° n ad vantageously supports different types of^ ocesslng 
of the query depending upon the nature of the query itself, thereby providing more efficient and effective analvsis 
the queries than existing latent semantic indexing methods which always treat a query as a documentSat is protected 
e™ Sam 7 ubspace - T y> ica "* tbe d --- as to whether to treat a query as a pseudo-document ^ r as se tof 
"nve ^iona d r n H """^ tSrmS - BUeh ' qU6neS ha -9 large number of terms a L processed £ 

i?^3irri r .i whiie queries having fewer numbers ° f terms (genera,| y in reiati ° n to 

mus have to ouaZ Z h! ETTh " T ! deSCfibed be ' 0W ' The precise number of terms a query 

£^ »ZZ£^ " 3 PSeUd °- d ° CUment Ca " ™* "PO" ^e application and is general^ 

ELn'ro^ St H nCe !, in 7!T h qUerV iS t0 be treat6d 33 3 Set of terms ' ,he meth °° of this aspect of the present 
SS^tSliSL ? " .° f ?' SUbSPaCe representation A* to exploit the latent semantics captured in the 
ZIZ ? repraS k entat,on - Since on, y 'he rows of the subspace representation ^ that correspond to the terms are 
analyzed, only these rows need be computed, thereby conserving processing time and resources For example a 
g ven query may contain two terms, term i and term j having equal importance. Thus, the query vector q can * defined 

both ; e rm s ^ ere,n T?' T and j ' h Unit VeCt0rS ' order to idenufy those documents i whiS 

both terms exist semantically, the conventional technique is to form a score vector by calculating the inner produc 



T. 7£XlK* n bi d i T 5 ° f ^ reSpeCtive 'y As known t0 *o.. skilled in the art and as mentioned above, 
is determined th^n 3 * ^ Re 9 a ^'^s of the manner in which the score vecto 

.s determined, the components of s are the scores of the respective documents. Unfortunately, this scoring technique 
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is flawed since the various terms represented by A, are not weighted on a term-by-term basis. Accordingly the high- 
frequency terms swamp the lower-frequency terms and may disadvantageously dominate the results. In a set of service 
bulletins at The Boeing Company consisting of 1 , 1 78 documents indexed by 3.514 terms, the term "engine was present 
at a much greater frequency than the term "idle". As described below in more detail, in instances in which the presence 
of the term "idle" was as important or possibly more important than the presence of the term "engine", the resulting 
scores could be misleading since the higher frequencies associated with the term "engine" would generally dominate 
the resulting scores. See, for example. Figures 9A and 9B which graphically depicts the entries associated with the 
terms "engine" and "idle" for a plurality of the 1 ,178 documents, respectively. 

r0040] As described above, globally weighting the terms prior to the projection of the working matrix A into the lower 
dimension subspace would greatly increase the difficulty associated with updating the document collection and may 
render the subspace representation unsuitable for applications such as the assignment of topic words. As such, the 
method, apparatus and computer program product of one aspect of the present invention weights only the respective 
rows of the subspace representation A* that relate to the terms identified by the query. See block 204. The relative 
importance of the terms of the query can be defined and the dominance of the high-frequency terms can therefore be 
abated if desired. By weighting the rows of the subspace representation A* as opposed to the work.ng matrix A. 
documents can be readily added and removed from the text document collection without re-indexing the ent.re docu- 
ment collection. In addition, the row weighting factors need not be defined for each row of the subspace representation 
A*. but only those rows that relate to the terms defined by the query. By appropriately weighting the rows with a term- 
weighting matrix W, the scoring formula is now represented as: 



wherein*.' is a matrix with row weighting scalars in its diagonal. As indicated above, however, other scoring 
techniques can be utilized in conjunction with this aspect of the present invention with the foregoing formula presented 
for purposes of example and not of limitation. 

[0041] Various techniques can be utilized to determine the relative weights of the rows of the subspace representation 
A, For example, the weights can be calculated as the inverse infinity norm that is defined as wherein*, is 

the maximum of the absolute values of the elements of a. Alternatively, the weights can be calculated as the inverse 
1-norm which is defined as *S*k wherein \K is the sum of the absolute values of the elements of a,. Still further, the 
weight can be calculated as the inverse 2-norm which is defined as*,'-»W. wherein*, is the square root of the sum 
of the squares of the elements of a/. 

[0042] The foregoing example minimizes the distinction between the steps represented by blocks 204 and 206 In 
block 204 the relevant rows of A* are selected and weighted; the selection being based on the keywords, and he 
weighting mitigating the effect of high-frequency terms. In block 206 a score vector is generated from these weighted 
rows. In the example above the score is produced by adding the two rows; however, a vanety of scoring functions 
could be used. The entries in the weighted rows for each document could be combined into a score by, for example, 
taking the sum of their squares or taking the maximum entry. Regardless of the particular weighting and scoring meth- 
odology the plurality of documents represented by the subspace representation A* can be scored with respect to he 
query and the documents can then be ranked in terms of score with the more relevant documents having a better 
score as indicated in block 206. As a result of the weighting of those portions of the subspace representat.on A* that 
relate to the terms of the query, the resulting scores can be more meaningful and will not be unnecessanly swamped 
by high-frequency terms or by the disproportionate use of terms. Additionally, the resulting score will not be adversely 
impacted by query terms that are zero since these terms are not considered and are now properly treated as irrelevant. 
Those documents having the best score can then be retrieved as being relevant or most relevant with respect to the 
terms identified by the query, as indicated by block 208. A geometrical illustration of the importance of row weighting 
is depicted in Figures 10A and 10B. In this illustration, the documents are projected onto the 2-D plane spanned by 
the £ and f terms, i.e., a, ("engines") and a,- ("idle") are the x- and y-coordinates of the documents. The projected 
documents are initially represented by The scoring method using the inner product can be depicted as a line with 
a slope of -1 that moves from the far upper-right corner to the lower-left comer. The sorting result is equivalen to he 
order in which the documents are touched by the moving line. In Figure 10A. the 20 documents having the best un- 
weighted scores are marked with V. and the scoring line is drawn so as to separate these 20 documents from the rest 
of the corpus Note that only documents containing many occurrences of "engines" are selected. In Figure 10B. term- 
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45 
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55 



ZS! S H haVe b6en 3PP,i ? d b3Sed UP ° n the inV6rSe 2 " n ° rm Weightin ^ technique and the 20 documents having the best 

«rit S?h 8C °™ 8 " IS d6ar th3t Wei9hting SCa,arS b0 ° St the contribution of a,, "idle" giving! 

parity with a,, eng.nes". These results will be hereinafter examined in more detail W 9 

I 0 ! 3 / ■ "J eX ^ P J! ° f thiS aSP6Ct t0 the prSSent invention « a text data collection consisting of 1,178 unique doc- 
7T^ c m ' t6 H mS qUeried ' ^ PreVi0US,y eXp ' alned ' each 0f the documents was a service' bu.ttin 
I LI £ T^^ an l C0 " S,Sled of two P arts - namel V. a subject and a body. According to this aspect of the 
S™:; of t h e documents were preprocessed and indexed and a subspace presentation A, was 
th 2 rZl? a m w - ; 06 ^ >S 3 denSe matrjX ' ** need not ever be formed explioitiy. Instead, by ullzing 
can l it ' " m Part,a ' f0WS ° f ^ the rows of ^ that relate or otherwise correspond to the terms of a quer^ 
can be determined or computed as necessary. ■ y 

aZSaJ^OX °l C ° m f ariS ° n ' ,! he " nwei 3 hted coring method defined above was tested with two queries, namely. 

iX^ZS^ ^T TT "? ' SeC °; d qUerV f ° r the temS " en9ines " and ' ,id,e " ln this P articular sample 
£ wWrh , ? - W mUCh 9reater fre ^ enc y' a PP eari "9 566 documents in contrast to the 76 documents 
i winch the erm .die" appeared. The results of the unweighted scoring method are depicted in Table 1 in which the 

esoe°cTe inl * *l t~ ^ ^ ^ M thiS ^ ard ' the **™«- are <is£ by * 

respective index assigned to each document, i.e.. document number 52. document number 245. etc. 

. TABLE 1 



Rank 


engines 


engines + idle 


1 


52 


245 


2 


245 


52 


3 


247 


238 


4 


46 


247 


5 


238 


57 


6 


56 


40 


7 


40 


229 


8 


229 


46 


9 


42 


42 


10 


57 


221 



KL ^M°h U9 t h the . reSU,tS °t ' he tW0 ^ ueries have soma difference in the order of the documents, the results clearly 
i~fn Fig'reT 9ra 9 nd e 9 S B. d0m,nateS *" rSSU,tS - ^ ° f — ^ W ^ ™- 1 * b *~ d is 

S^LiI he SamS C0,leCt i° t u ° f documents was then scored w «h respect to the query for the terms "engines" and 

hav7nI^T e 9 , We ' 9 the ; eSP6CtiVe t6rmS UtHi2ing the inverse 2 - norm wei 9h«ng technique. The 10 documents 

hav rig the best unweighted and weighted scores are depicted in Table 2 hereinbelow The 10 documents having the 

the L S r C m °H S 35 " T 1 " 1 ° f I"' aPP ' iCati0n ° f ^ IDF tSrm WSi9hting fact0r to a subs ° aca representation generaSrom 
the term document matrix by applying traditional 2-sided weighting functions is also provided for comparison purposes. 

TABLE 2 



Rank 


Unweighted score 


Weighted score 


IDF 


1 


245 


1024 


1022 


2 


52 


656 


240 


3 


238 


654 


1024 


4 


247 


1023 


1023 


5 


57 


652 


652 


6 


40 


653 


652 


7 


229 


1022 


654 


8 


46 


57 


39 


9 


42 


238 


236 


10 


221 


221 


656 



to °tZ L^ ti0na, ! y ' 3 "IT 3 ' SCreenin9 ° f thS documents was conducted and the 7 documents that are most relevant 
to the query consisting of the terms "engines" and "idle" were identified. The subject lines and the indices for each of 
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these 7 most relevant documents are listed below: 

652: ENGINE FUEL AND CONTROL (CF6-80C2 FADEC ENGINES) 



FUELCONTROL SYSTEM - MINIMUM 



IDLE REVISION 

654.: 
653: 
656: 



"ENGINE CONTROL (CF6-FADEC ENGINES) - ENGINE IDLE CONTROL SYSTEM - INSPECTION 
: ENGINE FUEL AND CONTROL (PW4000 ENGINE) - FUELCONTROL SYSTEM - MINIMUM IDLE REVISION 
ENGINE CONTROL (CF6-80C2 FADEC ENGINES) - ENGINE IDLECONTROL SYSTEM - INSPECTION 
1022: IGNITION (CF6-80C2 ENGINES) - IGNITION GENERAL - ENGINE IGNITION SYSTEM - MINIMUM IDLE 

^MGNITION (PW4000 ENGINES) - IGNITION GENERAL- ENGINE IGNITION SYSTEM - MINIMUM IDLE 

*024 S ENGINE CONTROL (CF6-80C2 FADEC ENGINES) - ENGINE IDLE CONTROL SYSTEM - INSPECTION 

[00481 As will be noted, the 10 best scoring documents utilizing the weighted scoring technique not only include each 
of the 7 documents, as indicated by bold face type, but have the 7 most relevant documents ranked as the best sconng 
documents. This example of utilizing inverse 2-norm row-weighting is graphically depicted in Figure 10B. In contrast, 
the unweighted scoring technique did not identify any of the 7 most relevant documents. 

[0049] As a further example, a query is based upon the subject lines "airplane general - airplane systems modification 
for higher altitude airfield operation-JT8D-17 series engines." Once stop words, including "a.rplane ,n this domain 
have been removed. 10 terms remain as listed below along with the number of occurrences in the text document 
collection. 



JT(957) 


High(168) 


engines(566) 
operation(272) 
modification(250) 
systems(220) 


Altitude(89) 
General(51) 
Series(23) 
Airfield(12) 



[0050] As the result of manual screening of the documents, the documents having indices 1-7 were independently 
determined to be most relevant to this query. As depicted in Table 3 hereinbelow, the 10 documents having the best 
scores based upon an inverse 2-norm weighted scoring technique and an IDF term weighting factor are listed with the 
most relevant documents indicated with bold face type. Again, the weighted scoring technique has identrfied the most 
relevant documents as the 7 best scoring documents. 

TABLE 3 



Rank 


Weighted score 


IDF 


1 


3 


240 


2 


6 


39 


3 


4 


236 


4 


7 


42 


5 


2 


2 


6 


1 


52 


7 


5 


45 


8 


648 


232 


9 


651 


235 


10 


56 


43 



[0051] Finally the unique subject lines of 1 .026 documents were utilized as separate queries with a search being 
termed a success if the 1 0 documents having the best scores for a particular query included the bodies of the documents 
corresponding to the subject line that formed the query. For comparison purposes, a conventional sconng technique 
that treats a query as a pseudo-document was compared with the weighted scoring techniques of the present invenUon 
utilizing inverse infinity-norm, inverse 1-norm and inverse 2-norm scoring techniques with the results tabutated below. 
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TABLE 3 



Method 


successful queries 


success rate 


Pseudo-document 


741 


72.2% 


Inverse infinity-norm 


843 


82.2% 


Inverse one-norm 


846 


82.5% 


Inverse two-norm 


856 


83.4% 



[0052] As illustrated by the foregoing examples, the information retrieval technique of this aspect of the present 
invention provdes ; more reliable and accurate results due to the weighting of the terms, i.e., rows, of the subspace 

eo' eslln I" T ft C ° rreSP r " ,0 t6rmS ° f th6 qUery ' 3dditi0n ' b * on * com P utina those ™ S of the subspace 
eo rlllTZ tK Respond to the terms of the query and by only weighting those same portions of the subspace 

and th e« w T 7?™* 1 ,nfor ™ ti ™ retrieva ' P~—. technique of the present invention is also improved 
to enhanced Ztt T d ° C , Uments can be added <* °' d documents removed from the text document coLtion 
o ™n l T u T Wei9h,in9 faCt0rS 0eed ROt be ^computed with each change to the text 

document collection. Moreover, by analyzing the query as a set of terms as opposed to a pseudo-document the ac 
curacy of the results are further improved since terms that are not included in the query are irrelevant wUh ^pect to 

by conventional scoring techniques that treat a query as a pseudo-document 

docflnf thlT 9 *° ( d . eCiSi0n b ' 0Ck 202 in R9Ure 2 ' in th0Se inStances in which the is ^ated as a pseudo- 

apSon * thl?^?. ''r 66 ' 5 " mUCh th6 S3me faShi ° n 38 d6SCribed above and in more d6tail 'he '888 
tranS^H 1 h 9 o^ry frequency vector is computed and the entries in the query vector are then statistically 
transformed according to a stat.st.cal transformation policy as shown in blocks 220 and 222, respectively Thereafter 
S Pr0 < e f d the k-dimensiona, subspace as shown in block 224. The similarity between the queny 

Zcu Jnt r? T VS r ,S th6n determined b * measuring 'he distance therebetween. See block 226 The 
thlT« ^ , SC ° red a0d may bS PreS6nted in 3 ranked order as depicted b '°<* 228. Further details of 

the method 5 T T** ' 888 application ' ' he contents * which have been incorporated by reference. ^ such 
tvLT. o PP ,f nd C ° mpU,er Pr ° gram Pr ° duct ° f the present invention can advantageously support differen 
of the i'STT 8 ? 7 P6nding UP ° n n3tUre ° f thS qUSry i,S6lf ' thereb ^ P r0Vidin 9 ™<° efficient analysis 
iS^toJZZ^^ indSXing m6th0dS — 3 ^ aS a d — < ba < " 

Eenls LTT^TT 8 " 0 information retrieval technique described above and depicted in Figure 6, 
fSTSir^ . , J" 1 " n ° ne ' ° r m ° re ° f 3 P,Ura,ity ° f P red ^ned classes as shown in Figures 7 and 8 

nc udinoat t » 8 f T " ? r^"*' C ' aSSeS "* by 3 matrix with each predefined class 

»?? '«as one term. Refernng now to the logic of Figure 7 and. in particular, to block 300. a decision is initially 

eJst^a in whfc^ir a 9 ra,n, s 9 'I 0 "" 6 "* 15 " WhiCh ° aSe 3 te - b V-lass matrix must be constructed. o P Z 
TS£ Ztt n ternvb y- c ' ass matrix exists and can be utilized. If a term-by-c.ass matrix is to be constructed 
a £Z£ 3i? 'a»y generated as depicted in block 304. In this regard and as with any classification method, there is 
S.?.^ te P Sl^.V™"n ,nB Samp,e h is USed t0 determine a cla «^r and a classification phase that uses this 
the oi/Jn. deter T e the ma " ner ,n wn,ch new documents will be classified into classes. According to this aspect of 
TreaZ known 1 J I ^^^-^ matrix is formed based on a set of documents whose class membership is 
more d 6 t ai ° Th / ?,T 9 SamP ' e - b,0Ck 306 genera ' and in b,ocks 151 ' 153 and ^ of Figure 3B for 
r 0 rTnsform arion T ^"1 ™ ^ fregUencies of the *»™ in the documents that belong to a glen Cass, 

a to s ZTZaLZT 9 3 r bSPaCe , representa,ion ° f tne classes is then generated from the matrix by using 
If 0rth ° 90na decomposition, analogous to the indexing of a term-by-document matrix D for information re 
nn«i ?, i for u exam P le ' F, 9 ure 3B - Th *s constitutes the training phase of the classifier. 

proceeds o block UoV^IT*? ClaSsification °P eration «• to be performed in block 308. the logic 

of?h! n«L £ \ : J i I Wh,Ch 3re S6t f ° rth in Figure 8 - As shown in block 21 0 of Fi g"re 8. a representation 
of he new document to be classified is received. The document is represented as a collection of terms. Those portions 

a e then weToh? T^T TS* °' term " b y- c,ass matrix tna < to the terms of the document to be class fied 
scribed aS ? h WJ ST mVerSe infinity n0rm> inV6rSe 1 - n ° rm ° r inverse 2 " norm lighting techniques de- 
same '™^d^d h re,atl ° nshi P for the -w document to each predefined Cass is then scored in the 

2irThrnl Ho ! ^ C ° n)Unctlon w,th the scorin g of ^ery relative to a plurality of documents. See block 

uoonlhe 3 ' fT 6 iT k 6 " C ' aSSified in, ° n ° ne ' ° ne ° r m ° re of ,he P' uralit y ^ Predefined classes based 
upon the scores of the relationship of the new document to each predefined class. See block 218 In this reqard the 
new document is typically c.assified into each predefined class for which the respective score meet a p ede^ermined 
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criteria. As such, the techniques described in more detail in conjunction with information retrieval can also be applied 
in an analogous fashion to the classification of documents into a plurality of predefined classes without departing from 
the spirit and scope of the present invention. Accordingly, the document classification technique of this aspect of the 
present invention also provides comparable advantages both in terms of efficiency, reliability and accuracy as further 

5 described above in conjunction with the information retrieval aspect of the present invention. 

[0056] Figures 1-8 are block diagram, flowchart and control flow illustrations of methods, apparatus and computer 
program products according to the invention. It will be understood that each block or step of the block diagram, flowchart 
and control flow illustrations, and combinations of blocks in the block diagram, flowchart and control flow illustrations, 
can be implemented by computer program instructions or other means. Although computer program instructions are 

10 discussed hereinbelow, for example, an apparatus according to the present invention can include other means, such 
as hardware or some combination of hardware and software, including one or more processors or controllers for per- 
forming the information retrieval and/or document classification. 

[0057] in this regard, Figure 11 depicts the apparatus of one embodiment including several of the key components 
of a general purpose computer 50 on which the present invention may be implemented. Those of ordinary skill in the 

15 art will appreciate that a computer includes many more components than those shown in Figure 11 . However, it is not 
necessary that all of these generally conventional components be shown in order to disclose an illustrative embodiment 
for practicing the invention. The computer 50 includes a processing unit 60 and a system memory 62 which includes 
random access memory (RAM) and read-only memory (ROM). The computer also includes nonvolatile storage 64. 
such as a hard disk drive, where data is stored. The apparatus of the present invention can also include one or more 

20 input devices 68. such as a mouse, keyboard, etc. A display 66 is provided for viewing text mining data, and interacting 
with a user interface to request text mining operations. The apparatus of the present invention may be connected to 
one or more remote computers 70 via a network interface 72. The connection may be over a local area network (LAN) 
wide area network (WAN), and includes ail of the necessary circuitry for such a connection. In one embodiment of the 
present invention, the document collection includes documents on an Intranet. Other embodiments are possible, in- 

25 eluding: a local document collection, i.e., all documents on one computer, documents stored on a server and/or a client 
in a network environment, etc. 

[0058] Typically, computer program instructions may be loaded onto the computer or other programmable apparatus 
to produce a machine, such that the instructions which execute on the computer or other programmable apparatus 
create means for implementing the functions specified in the block diagram, flowchart or control flow block(s) or step 

30 (s). These computer program instructions may also be stored in a computer-readable memory that can direct a computer 
or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer- 
readable memory produce an article of manufacture including instruction means which implement the function specified 
in the block diagram, flowchart or control flow block(s) or step(s). The computer program instructions may also be 
loaded onto the computer or other programmable apparatus to cause a series of operational steps to be performed on 

35 the computer or other programmable apparatus to produce a computer implemented process such that the instructions 
which execute on the computer or other programmable apparatus provide steps for implementing the functions specified 
in the block diagram, flowchart or control flow block(s) or step(s). 

[0059] Accordingly, blocks or steps of the block diagram, flowchart or control flow illustrations support combinations 
of means for performing the specified functions, combinations of steps for performing the specified functions and pro- 

40 gram instruction means for performing the specified functions. It will also be understood that each block or step of the 
block diagram, flowchart or control flow illustrations, and combinations of blocks or steps in the block diagram, flowchart 
or control flow illustrations, can be implemented by special purpose hardware-based computer systems which perform 
the specified functions or steps, or combinations of special purpose hardware and computer instructions. 
[0060] In a preferred embodiment of the system of figure 11 a method according to the present invention can be 

45 carried out using a web browser type of software by means of which a user logs into a server running the actual text 
mining operation. The client running the web browsing software can be any computer (PC, UNIX workstation, etc.) 
[0061] After log in, the user enters an main menu in which one can choose between the following choices: query, 
create indexes, update indexes, clustering and visualization. Other possible operations instead are possible such as 
trend analysis, classification, etc. 

50 [0062] The main menu option "indexing" can be performed by a user by submitting data to a designated area that 
the server can access and then set parameters such as "minimum word frequency" or "number of dimensions". Indexing 
is then performed in the ways that are described in the above while the user is shown a progress indication such as 
an hourglass. Indexing needs to be performed whenever a new data domain is being used. Indexes need to be updated 
whenever a data domain is being changed. 

55 [0063] After indexing, a user can choose the option "visualization" in the main menu. When within the visualization 
model the user chooses a domain for which an index has been made the user can enter "principle dimensions to 
display". Which can be done using three axes. Also the user can enter "users supplied terms of interest" which can 
correlate positively with one of the axes or can correlate negatively with one of the axes. If three axes are being used 
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six "user supplied terms of interest" can be entered, which are either single words terms or multiple word terms Here 
after a user can visualize correlations in a space defined by the axes. By such a graphic presentation of the data a 
user can get a high level view of the data visually. . «« <j 

[0064] Another option of the main menu is "clustering". By clicking this option in the main menu a user enters the 

TrlnTTT* c 9 !? in 3 USerCa " 9ntSra minimum and maxirnum number of clusters thereby entering 

Irle ° k , ^ Furt u h ermore. a size of clustering subspace can be entered or selected from a pull down menu 
Hereafter, by clicking a button labeled "cluster" the actual clustering is being performed 

word?,™ Jh SU,t ° f h fhisc ' us,erin 9 °P eration is being turned as a list of clusters. Each cluster is summarized by topic 
W . aSS0C ' ated With each to P ic word indi <*«"9 ^s importance level. By clicking a button that 

is re a ted to the cluster on screen a user can further view all the documents in each cluster and the result is shown in 
a detail view called "clustering details". 

fSUta^f S I C ' USter ! d dSfa ! ViSW ° ° mPriSeS am ° ngSt ° ther ° bjeCtS a matrix compromising four columns. Column 
documen? s »n* rT f 2 *• d ° CUment tit,e ' co,u ™ 3 boxes for selecting 

t Z?!T 4 comprises a button for show a summary of the document. Another option of the title column 

is that a user can view and complete document by clicking on the title which for this purpose is hyperlinked Furthermore 

and S no tl n dTh V e e « eS % C,USt6 H S " Se ' eCti ° nS there0f 38 3 daSS - The S ' stern can ^ the sampte" 

and notified the user of new documents which fit into such a class. If a user does not mark individual documents as 

ft 6 n i, Sam ? les ' the com P |et e set of documents in this cluster can be used as a default sample set 

SnS •• r? ti0n ln thS main menU iS " Perf0rm 3 querv "- Wnen tnis °P tion is clic ked the user enters a "perform 

1 T1Z; SCre6 rt n - / SC ^ en COmprises a P"" down me ™ ft"" selecting a domain, a query portion for typing query text 
siz oi ^a subsoa^e ^ In ^ ? " 3 qUWy by ^ w 3 qUerV bV exam P ,e ' a Portion Centering the 

slarting the query ' * " ** maXimUm """^ ° f d °° Uments to retum ' and a bu "on for 

[0068] Hereafter a listing of the query results will appear in a matrix format. Four columns of this matrix comprise 

o E£S£?. T 4 ' tit ! 6 ° f J he dOCUment ' thS Slmi,arity ° f the document and a c °'"™ compris ng butto J 
orovidedTh 9 fl H q T > l eXamplS -. Each document can be ™< b y clicking on the title of the document which is 

o Zen H , , hSr ° Pt,0n ° f SUCh 3 r6SUlt Wind0W COuld be 10 P rovide tne °P tion of showing a summary 

of each document as is described in the above. ' 

[0069] Many modifications and other embodiments of the invention will come to mind to one skilled in the art to which 

T^s ZTeZTr ^ , benem ; f teaChin9S PreSent6d in *e foregoing descriptions and the as octtea 
t Z Z™nfT' 15 ? ! understood that » he invention is not to be limited to the specific embodiments disclosed 
and that modifications and other embodiments are intended to be included within the scope of the appended claims 

o 5 ion" emP ' 0yed h6rein ' th6y ar6 US6d 3 9eneriC and deSCriptive Sense on, V «"d rSSrSiSSS 



Claims 



d 0 ?,Znt° retriav,n g m ormation from a text data collection that comprises a plurality of documents with each 
document comprised of a plurality of terms, wherein the text data collection is represented by a term-by-document 

oIll» 9 H a P k y ,° f entr ? S WUh 6aCh &ntry b6ing the fre « uencv of occurrence of a term in a respective 
document, and wherein the method comprises: K 



projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional sub- 
byme q°uery ** ^ POrti ° nS ° f 3 SUbspace ^presentation A, relating to a term identified 

foiwShl 'n aSt t ?° Se P ,°'!i 0nS ° f 3 SUbSpaCe re P rese "tation A, relating to a term identified by the query 
following the projection into the lower dimensional subspace; and 

scoring the plurality of documents with respect to the query based at least partially upon the weighted portion 
of the subspace representation Aj,. a ^ u,uon 

to tTlllt aCCOrdin9 1 °, Gl3im 1 Wh6rein ,hesubs P acere P res entationA > includesa P luralityofrowscorres P onding 
to respective terms, and wherein said weighting comprises determining an inverse infinity norm of the term. 

to tTn th0d a ' C ° rding l ° C ' aim 1 wherein the s^space representation A, includes a plurality of rows corresponding 
to respective terms, and wherein said weighting comprises determining an inverse 1-norm of the term. 
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4 A method according to Claim 1 wherein the subspace representation A* includes a plurality of rows corresponding 
to respective terms, and wherein said weighting comprises determining an inverse 2-norm of the term. 

5. A method according to Claim 1 further comprising weighting the term-by-document matrix on a document-by- 
document basis prior to the projection into the lower dimensional subspace. 

6 A method according to Claim 1 wherein the projection into the lower dimensional subspace comprises obtaining 
an orthogonal decomposition of the representation of the term-by-document matrix into a k-d.mensional subspace. 

7. A method according to Claim 1 further comprising identifying respective documents based upon relative scores 
of the documents with respect to the query. 

8. A method of classifying a document with respect to a plurality of predefined classes defined by a term-by-class 
matrix with each predefined class including at least one term, wherein the method comprises: 

receiving a representation of the document to be classified; 

projecting a representation of at least a portion of the term-by-class matrix into a lower dimensional subspace 
to thereby create at least those portions of a subspace representation A* relating to a term .ncluded w.thin the 
representation of the document to be classified; 

weighting at least those portions of the subspace representation A* relating to a term .ncluded within the 
representation of the document to be classified following the projection into the lower dimensional subspace; 
scoring the relationship of the document to each predefined class based at least partially upon the weighted 
portion of the subspace representation A*; and 

determining if the document is to be classified into any of the plurality of predefined classes based upon the 
scores of the relationship of the document to each predefined class. 

9 A method according to Claim 8 wherein the subspace representation A k includes a plurality of rows corresponding 
to respective terms, and wherein said weighting comprises determining an inverse infinity norm of the term. 

10 A method according to Claim 8 wherein the subspace representation ^ includes a plurality of rows corre- 
sponding to respective terms, and wherein said weighting comprises determining an inverse 1-norm of the term. 

11 A method according to Claim 8 wherein the subspace representation A* includes a plurality of rows corre- 
sponding to respective terms, and wherein said weighting comprises determining an inverse 2-norm of the term. 

1 2. A method according to Claim 8 further comprising weighting the term-by-class matrix on a class-by-class basis 
prior to the projection into the lower dimensional subspace. 

1 3 A method according to Claim 8 wherein the projection into the lower dimensional subspace comprises obtaining 
an orthogonal decomposition of the representation of the term-by-class matrix into a k-dimensional subspace. 

14 A method retrieving information from a text data collection that comprises a plurality of documents with each 
document comprised of a plurality of terms, wherein the text data collection is represented by a term-by-document 
matrix having a plurality of entries with each entry being the frequency of occurrence of a term in a respecuve 
document, and wherein the method comprises: 

receiving a query; 

determining if the query is to be treated as a pseudo-document or as a set of terms; 

processing the query in different manners depending upon the treatment of the query as a pseudo-document 
or as a set of terms; and 

scoring the plurality of documents with respect to the query based upon said processing of the query. 

1 5. A method according to Claim 14 wherein the processing of the query in instances in which the query is treated 
as a set of terms comprises: 

projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional sub- 
space to thereby create at least those portions of a subspace representation \ corresponding to a term iden-. 
tified by the query; and 
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weighting at least those portions of a subspace representation A k corresponding to a term identified by the 
query following the projection into the lower dimensional subspace, 

and wherein said scoring comprises scoring the plurality of documents with respect to the query based at least 
partially upon the weighted portion of the subspace representation A*. 

16. A method according to Claim 14 wherein the processing of the query in instances in which the query is treated 
as a pseudo-document comprises: 

projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional sub- 
space; 

projecting a query vector representative of the query into the lower dimensional subspace- and 

comparing the projection of the query vector and the representation of at least a portion of the term-by-docu- 

ment matrix, ' 

and wherein said scoring comprises scoring the plurality of documents with respect to the query based at least 
partial y upon the comparison of the projection of the query vector and the representation of at least a portion 
or the term-by-document matrix. 

17. A computer program product for retrieving information from a text data collection that comprises a plurality of 
documents with each document comprised of a plurality of terms, wherein the text data collection is represented 
by a term-by-document matrix having a plurality of entries with each entry being the frequency of occurrence of a 
term in a respective document, wherein the computer program product comprises a computer-readable storage 
medium hav.ng computer-readable program code means embodied in said medium, and wherein said computer- 
readable program code means comprises: 

first computer-readable program code means for receiving a query; 

second computer-readable program code means for projecting a representation of at least a portion of the 
term-by-document matrix into a lower dimensional subspace to thereby create at least those portions of a 
subspace representation A* relating to a term identified by the query- 
third computer-readable program code means for weighting at least those portions of a subspace represen- 
tation A* relating to a term identified by the query following the projection into the lower dimensional subspace; 

fourth computer-readable program code means for scoring the plurality of documents with respect to the 
based at least partially upon the weighted portion of the subspace representation A*. 



query 



18. A computer program product according to Claim 17 wherein the subspace representation A, includes a plurality 
of rows corresponding to respective terms, and wherein said third computer-readable program code means de- 
termines an inverse infinity norm of the term. 

19. A computer program product according to Claim 17 wherein the subspace representation A* includes a plurality 
of rows correspondmg to respective terms, and wherein said third computer-readable program code means de- 
termines an inverse 1-norm of the term. 

20. A computer program product according to Claim 1 7 wherein the subspace representation A> includes a plurality 
of rows corresponding to respective terms, and wherein said third computer-readable program code means de- 
termines an inverse 2-norm of the term. 

21. A computer program product according to Claim 17 further comprising fifth computer-readable program code 
means for -weighting the term-by-document matrix on a document-by-document basis prior to the projection into 
the lower dimensional subspace. y lo 

22. A computer program product according to Claim 17 wherein said second computer-readable program code 
means obtains an orthogonal decomposition of the representation of the term-by-document matrix into a k-dimen- 
sionai subspace. 

m 3 : a A „!r P H te ^ Pr09ram Pr ° dUCt accordin 9 to C,aim 17 further comprising sixth computer-readable program code 
means for identrfy.ng respect,ve documents based upon relative scores of the documents with respect to the query. 

24. A computer program product for classifying a documentwith respect to a plurality of predefined classes defined 
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by a term-by-class matrix with each predefined class including at least one term, wherein the computer program 
product comprises a computer-readable storage medium having computer-readable program code means em- 
bodied in said medium, and wherein said computer-readable program code means comprises: 

first computer-readable program code means for receiving a representation of the document to be classified, 
second computer-readable program code means for projecting a representation of at least a portion of the 
term-by-class matrix into a lower dimensional subspace to thereby create at least those portions of a subspace 
representation A* relating to a term included within the representation of the document to be classified; 
third computer-readable program code means for weighting at least those portions of the subspace represen- 
tation A* relating to a term included within the representation of the document to be classified following the 
projection into the lower dimensional subspace; 

fourth computer-readable program code means for scoring the relationship of the document to each predefined 

class based at least partially upon the weighted portion of the subspace representation A*; and 

fifth computer-readable program code means for determining if the document is to be classified into any of 

the plurality of predefined classes based upon the scores of the relationship of the document to each predefined 

class. 

25 A computer program product according to Claim 24 wherein the subspace representation A* includes a plurality 
of 'rows corresponding to respective terms, and wherein said third computer-readable program code means de- 
termines an inverse infinity norm of the term. 

26 A computer program product according to Claim 24 wherein the subspace representation A* includes a plurality 
of 'rows corresponding to respective terms, and wherein said third computer-readable program code means de- 
termines an inverse 1-norm of the term. 

27 A computer program product according to Claim 24 wherein the subspace representation A* includes a plurality 
of 'rows corresponding to respective terms, and wherein said third computer-readable program code means de- 
termines an inverse 2-norm of the term. 

28 A computer program product according to Claim 24 further comprising sixth computer-readable program code 
means for weighting the term-by-class matrix on a class-by-class basis prior to the projection into the lower di- 
mensional subspace. 

29 A computer program product according to Claim 24 wherein said second computer-readable program code 
means obtains an orthogonal decomposition of the representation of the term-by-class into matrix a k-d.mens.onal 
subspace. 

30 A computer program product for retrieving information from a text data collection that comprises a plurality of 
documents with each document comprised of a plurality of terms, wherein the text data collection is represented 
by a term-by-document matrix having a plurality of entries with each entry being the frequency of occurrence of a 
term in a respective document, wherein the computer program product comprises a computer-readable storage 
medium having computer-readable program code means embodied in said medium, and wherein sa.d computer- 
readable program code means comprises: 

first computer-readable program code means for receiving a query; 

second computer-readable program code means for determining if the query is to be treated as a pseudo- 
document or as a set of terms; 

third computer-readable program code means for processing the query in different manners depending upon 
the treatment of the query as a pseudo-document or as a set of terms; and 

fourth computer-readable program code means for scoring the plurality of documents with respect to the query 
based upon said processing of the query. 

31. A computer program product according to Claim 30 wherein said third computer-readable program code means 
comprises: 

fifth computer-readable program code means, operable in instances in which the query is treated as a set of 
terms for projecting a representation of at least a portion of the term-by-document matrix into a lower dimen- 
sional subspace to thereby create at least those portions of a subspace representation A, correspond.ng to 
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a term identified by the query; and 

sixth computer-readable program code means, also operable in instances in which the query is treated as a 
set of terms, for weighting at least those portions of a subspace representation A k corresponding to a term 
identified by the query following the projection into the lower dimensional subspace, 

and wherein said fourth computer-readable program code means scores the plurality of documents with re- 
spect to the query based at least partially upon the weighted portion of the subspace representation A. in 
instances in which the query is treated as a set of terms. 

32. A computer program product according to Claim 30 wherein said third computer-readable program code means 
com p ri ses . 

fifth computer-readable program code means, operable in instances in which the query is treated as a pseudo- 
document, for projecting a representation of at least a portion of the term-by-document matrix into a lower 
dimensional subspace; 

sixth computer-readable program code means, also operable in instances in which the query is treated as a 
pseudo-document, for projecting a query vector representative of the query into the lower dimensional sub- 
space; and 

seventh computer-readable program code means, further operable in instances in which the query is treated 
as a pseudo-document, for comparing the projection of the query vector and the representation of at least a 
portion of the term-by-document matrix. 

and wherein said fourth computer-readable program code means scores the plurality of documents with re- 
spect to the query based at least partially upon the comparison of the projection of the query vector and the 
representation of at least a portion of the term-by-document matrix in instances in which the query is treated 
as a pseudo-document. M y uccucu 

33. An apparatus for retrieving information from a text data collection that comprises a plurality of documents with 
each document comprised of a plurality of terms, wherein the text data collection is represented by a term-by- 
document matrix having a plurality of entries with each entry being the frequency of occurrence of a term in a 
respective document, and wherein the apparatus comprises: 

means for receiving a query; 

means for projecting a representation of at least a portion of the term-by-document matrix into a lower dimen- 

S ! on f' s " b i s P a i f e to thereb y cr *ate at 'east those portions of a subspace representation A, relating to a term 
identified by the query; 

means for weighting at least those portions of the subspace representation A* relating to a term identified by 
the query following the projection into the lower dimensional subspace; and 

means for scoring the plurality of documents with respect to the query based at least partially upon the weighted 
portion of the subspace representation A*. 



34. An apparatus according to Claim 33 wherein the subspace representation A* includes a plurality of rows cor- 
responding to respective terms, and wherein said means for weighting comprises meansfor determining an inverse 
infinity norm of the term. 

35. An apparatus according to Claim 33 wherein the subspace representation A, includes a plurality of rows cor- 
responding to respective terms, and wherein said means for weighting comprises means for determining an inverse 
1-norm of the term. ^ 

36. An apparatus according to Claim 33 wherein the subspace representation A, includes a plurality of rows cor- 
responding to respective terms, and wherein said means for weighting comprises means for determining an inverse 
^-norm of the term. 

37 An apparatus according to Claim 33 further comprising means for weighting the term-by-document matrix on 
a document-by-document basis prior to the projection into the lower dimensional subspace. 

38 An apparatus according to Claim 33 wherein said means for projecting a representation of at least a portion 
of the term-by-document matrix into a lower dimensional subspace comprises means for obtaining an orthogonal 
decompos,t.on of the representation of the term-by-document matrix into a k-dimensional subspace 
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39. An apparatus according to Claim 33 further comprising means for identifying respective documents based 
upon relative scores of the documents with respect to the query. 

40. An apparatus for classifying a document with respect to a plurality of predefined classes defined by a term-by- 
class matrix with each predefined class including at least one term, wherein the apparatus comprises: 

means for receiving a representation of the document to be classified; 

means for projecting a representation of at least a portion of the term-by-class matrix into a lower dimensional 
subspace to thereby create at least those portions of a subspace representation A* relating to a term included 
within the representation of the document to be classified; 

means for weighting at least those portions of the subspace representation A* relating to a term included within 
the representation of the document to be classified following the projection into the lower dimensional sub- 
space; 

means for scoring the relationship of the document to each predefined class based at least partially upon the 
weighted portion of the subspace representation A^; and 

means for determining if the document is to be classified into any of the plurality of predefined classes based 
upon the scores of the relationship of the document to each predefined class. 

41. An apparatus according to Claim 40 wherein the subspace representation A* includes a plurality of rows cor- 
responding to respective terms, and wherein said means for weighting comprises means for determining an inverse 
infinity norm of the term. 

42. An apparatus according to Claim 40 wherein the subspace representation A* includes a plurality of rows cor- 
responding to respective terms, and wherein said means for weighting comprises means for determining an inverse 

1- norm of the term. 

43. An apparatus according to Claim 40 wherein the subspace representation A* includes a plurality of rows cor- 
responding to respective terms, and wherein said means for weighting comprises means for determining an inverse 

2- norrn of the term. 

44. An apparatus according to Claim 40 further comprising means for weighting the term-by-class matrix on a 
class-by-class basis prior to the projection into the lower dimensional subspace. 

An apparatus according to Claim 40 wherein said means for projecting a representation of at least a portion 
of the term-by-class matrix into a lower dimensional subspace comprises means for obtaining an orthogonal de- 
composition of the representation of the term-by-class matrix into a k-dimensional subspace. 

46. An apparatus for retrieving information from a text data collection that comprises a plurality of documents with 
each document comprised of a plurality of terms, wherein the text data collection is represented by a term-by- 
document matrix having a plurality of entries with each entry being the frequency of occurrence of a term in a 
respective document, and wherein the apparatus comprises: 

means for receiving a query; 

means for determining if the query is to be treated as a pseudo-document or as a set of terms; 
means for processing the query in different manners depending upon the treatment of the query as a pseudo- 
document or as a set of terms; and 

means for scoring th e plurality of documents with respect to the query based upon said processing of the query. 

47. An apparatus according to Claim 46 wherein said means for processing comprises: 

means, operable in instances in which the query is treated as a set of terms, for projecting a representation 
of at least a portion of the term-by-document matrix into a lower dimensional subspace to thereby create at 
least those portions of a subspace representation \ corresponding to a term identified by the query; and 
means also operable in instances in which the query is treated as a set of terms, for weighting at least those 
portions of a subspace representation A k corresponding to a term identified by the query following the projection 
into the lower dimensional subspace, 

and wherein means for scoring scores the plurality of documents with respect to the query based at least 
partially upon the weighted portion of the subspace representation A k in instances in which the query is treated 
as a set of terms. 
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48. An apparatus according to Claim 46 wherein said means for processing comprises: 

means operable in instances in which the query is treated as a pseudo-document, for projecting a represen- 
tation of at least a portion of the term-by-document matrix into a lower dimensional subspace- 
means, also operable in instances in which the query is treated as a pseudo-document, for projecting a querv 
vector representative of the query into the lower dimensional subspace; and 

means, further operable in instances in which the query is treated as a pseudo-document, for comparing the 
projection of the query vector and the representation of at least a portion of the term-by-document matrix 
and wherein said means for scoring scores the plurality of documents with respect to the query based at least 
partially upon the comparison of the projection of the query vector and the representation of at least a portion 
of the term-by-document matrix in instances in which the query is treated as a pseudo-document. 

49. Method according to one or more of the preceding claims in which a client-server system is used in which a 
user operating on a client computing device receives information from a text data collection which is stored on 
server computing device. 

50. Method according to claim 49 in which the client computing device operates a webbrowsing type of software 
for executing the method cooperation with webserver software operating on the server 
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